Air Quality Indicators and Disease Prevalence Across the US

Final Project
Data Science 1 with R (STAT 301-1)

Author

Cassie Lee

Published

November 27, 2023

Introduction

Air pollution is the presence of sufficient quantities of contaminants in the atmosphere for a duration that is long enough to cause harm to human health.1 Air pollution mainly enters the body through the lungs, and while it mostly impacts the heart, lungs, and brain, it has the potential to affect other organs as well by traveling through the bloodstream.2

In the EDA, I explore the relationships between air quality indicators, certain lung diseases, and how sociodemographic vulnerabilities affect the relationships between air quality and health in the United States (excluding territories). The goal of this analysis is to build a deeper understanding of how various air pollutants impact health across the United States.

I queried the data at the county level and downloaded from the CDC National Environmental Public Health Tracking Network interactive data explorer.3 This network was built to centralize environmental health data on the national, state, and county level across the United States.4

The three central questions I explored in this analysis were:

  1. How are various air and environmental quality indicators related to each other?

  2. What are the most important indicators for determining the prevalence of asthma, cancer, and chronic obstructive pulmonary disease?

  3. How do sociodemographics like age, gender, race, and socioeconomic vulnerability affect the relationships between lung disease and air quality?

Data overview & quality

The air pollutants I selected includes the days over the ozone standard, the percent of days over the PM 2.5 standard, benzene, formaldehyde, acetaldehyde, carbon tetrachloride, and 1,3-butadiene pollution. The environmental quality indicators I selected includes the percent of people living near highways, the percent of public schools near highways, access to parks, and methods of transportation to work (walking, biking, driving alone, carpooling, public transportation, and none).

The indicators I selected for prevalence of lung diseases includes the crude prevalence of adult and child asthma, the crude and age adjusted rates of emergency department visits for asthma, age adjusted rates of lung and bronchus cancer, and the crude and age adjusted rates of chronic obstructive pulmonary disease.

The indicators of sociodemographic data I selected includes age, gender, race, and and the social vulnerability index5. I compared sociodemographic characteristics of each county to other counties in the United States and used this to identify counties with comparatively higher percentages of vulnerable groups (Appendix A). I also identified the majority race demographic in each county.

If available, I downloaded all data at the county level for 2018. Crude rates of child asthma and age adjusted rates of lung and bronchus cancer were only available at the state level. I downloaded data for the usual method of transportation to work for the time period of 2017 to 2021.

The final merged dataset includes 42 variables and 3144 observations matching each county and Washington DC. There are 4 identifying variables, 5 factor variables, 32 numeric variables, and one simple features geometry variable for mapping.

With the exception of Rio Arriba, Mexico, all counties have complete sociodemographic information. Rio Arriba, Mexico is missing data for the social vulnerability index. The prevalence of adult asthma is complete across all observations, but 1402 observations are missing for the prevalence of child asthma and 1789 observations are missing for crude and age adjusted emergency department visits for asthma. The majority of counties do not have information about the days over the ozone and PM 2.5 air quality standards. 1567 observations are missing data for the percent of public school near highways. All other variables are either complete or missing at most 1 observation.

Explorations

Indicators of air and environmental quality

To address the first central question, I used univariate variate analysis to understand the distribution of the indicators of air and environmental quality. I also used bivariate analysis to understand if and how certain indicators were related to each other.

Air quality: ozone and PM 2.5

As seen in Figure 1, the distribution of days over the ozone and particulate matter size 2.5 microns (PM 2.5) standards are both skewed right by observations with significantly more days over the standard. For both ozone and PM 2.5, most of the counties experienced no more than 20 days over the standard. However, the degree of right skewing for the distribution of days over ozone standard is significantly higher than for the distribution of days over the PM 2.5 standard.

Figure 1: Distribution of days over air quality standards.

Given that the distribution is so heavily skewed right, I was interested to see the distribution of the days over the ozone and PM 2.5 standards across the US. Although most of the data is missing, Figure 2 shows that Southern California has high levels of ozone pollution and Central California has high levels of PM 2.5 pollution. For PM 2.5, the days over the PM 2.5 standard aligns with the California wildfire incident map in 2018.6

Figure 2: Distribution of days over the ozone and PM 2.5 standard across the United States, excluding Hawaii and Alaska.

Figure 3 shows the relationship between the days over the ozone and PM 2.5 standard. There is a positive relationship between the two air quality indicators, indicating that counties with worse ozone pollution generally also have worse PM 2.5 pollution. This is consistent with research about the sources of ozone and PM 2.5 pollution, as they can both originate from nitrogen oxides from power plants, industrial pollution, and automobiles.7

However, there are a large number of counties that report having 0 days over the PM 2.5 standard while having several days over the ozone standard and several counties that report having 0 days over the ozone standard while having several days over the PM 2.5 standard. This is also consistent with research showing that these pollutants also have sources that produce one pollutant, but not the other. For example, construction sites, unpaved roads, fields, smokestacks and fires produce PM 2.5 pollution, but not ozone pollution.8

Figure 3: Relationship between days over the PM 2.5 standard and days over the ozone standard excluding outliers (over 50 days over one or both standards).

Air quality: benzene, formaldehyde, acetaldehyde, carbon tetrachloride, and 1,3-butadiene

The other group of air quality indicators I was interested in were benzene, formaldehyde, acetaldehyde, carbon tetrachloride, and 1,3-butadiene concentrations. Figure 4 shows the distribution of these air pollutants. Benzene, formaldehyde, acetaldehyde, and 1,3-butadiene are skewed right, indicating that there are several counties that have unusually high levels of these pollutants. This is expected, as counties with usually high industrial pollution would cause this distribution. However, carbon tetrachloride is skewed left, indicating that there are several counties that have unusually low levels of these pollutants. One reason that this distribution may differ from the others is that carbon tetrachloride is not naturally occuring, while the other pollutants are. Thus counties with usually low levels of carbon tetrachloride may be counties that have never had high levels of carbon tetrachloride exposure, and thus are capable of having extremely low values.9

Figure 4: Distribution of five air pollutants.

Then, I explored how these five air pollutants were correlated with each other. Figure 5 shows a correlation matrix of these air pollutants. Formaldehyde and acetaldehyde are highly correlated with each other, while the rest of the pollutants were somewhat or barely correlated with each other. Carbon tetrachloride and benzene are somewhat correlated, and 1,3-butadiene is somewhat correlated with both formaldehyde and acetaldehyde. Benzene is barely correlated with formaldehyde and acetaldehyde. Correlation between these pollutants indicate similar sources of pollution that emit multiple pollutants at the same time.

Figure 5: Correlation matrix of five air pollutants.

Air quality: combined

After exploring how these five pollutants were correlated with each other, I wanted to see if they were correlated with ozone and PM 2.5 pollution. Figure 6 shows that ozone and PM 2.5 pollution are not particularly correlated with the other 5 pollutants. This is likely because the five chemicals are nearly striclty from industrial pollution, while ozone and PM 2.5 pollution can have significant non-industrial sources, such as from automobiles.

Figure 6: Correlation matrix of all air quality indicators.

Environmental quality

Upon bivariate analysis between air quality indicators and environmental quality indicators, I decided not to move forward with analysis including environmental indicators because they were not particularly predictive of air quality. For example, I had originally suspected that the percent of population living near a highway would be predictive of ozone levels, however, Figure 7 shows that it is not. This lack of relationship between environmental quality and air quality suggested that continuing to explore indicators of environmental quality would not help me answer my three main questions.

Figure 7: The relationship between the days over ozone standard and the percent of people living within 150 M of a highway is an example of how environmental quality indicators were not particularly predictive of air quality indicators.

Lung disease and air quality indicators

Once I had an understanding of how the air quality indicators were related to each other, I was interested in seeing how air pollution and various lung diseases were related.

Asthma

Figure 8 shows that although ozone and PM 2.5 pollution are known to aggravate lung diseases such as asthma, there is not particularly clear relationship between emergency department visits for asthma and ozone or PM 2.5 pollution.10 It is possible that given better predictions and access to air quality information online, individuals with asthma are better able to avoid long exposures to high ozone and PM 2.5 levels, allowing them to avoid aggravating their asthma.

Figure 8: The crude rate of emergency deartment visits for asthma per 10 K population as a function of days over the ozone and PM 2.5 standard.

However, Figure 9 shows a clear positive relationship between the prevalence of asthma and exposure to the pollutants formaldehyde and acetaldehyde for both adults and children. Since the relationship between asthma and these two pollutants holds in both childhood and adulthood, I suspect that exposure to these pollutants in childhood is can be associated with the development of asthma, which is then carried into adulthood. The relationship between childhood asthma and formaldehyde exposure has been supported by various studies.11 There is limited and conflicting evidence about the long term effects of acetaldehyde exposure, so the positive relationship between childhood asthma and acetaldehyde exposure in these graphs may just be a result of the extremely strong correlation between formaldehyde and acetaldehyde.

Figure 9: The prevalence of adult and child asthma prevalence (percent of population) as a function of formaldehyde and acetaldehyde concentrations.

Lung and bronchus cancer

To explore how lung and bronchus cancer was associated with the air quality indicators, I used a correlation matrix to identify potentially interesting relationships to explore. Figure 10 shows that the measure of days over the ozone and PM 2.25 standard were not positively correlated with cancer, but it is important to note that there was a lot of missing data for these two indicators. On the other hand, there are relatively strong positive correlations between lung and bronchus cancer and the pollutants formaldehyde, acetaldehyde, and carbon tetrachloride.

Figure 10: Correlation matrix of lung and bronchus cancer and air quality indicators.

Figure 11 visualizes the relationship between the pollutants formaldehyde, acetaldehyde, and carbon tetrachloride and the prevalence of lung and bronchus cancer. As expected, formaldehyde and acetaldehyde have very similar relationships with the prevalence of lung and bronchus cancer. However, it is surprising that the relationship between lung and bronchus cancer and carbon tetrachloride has such a high correlation and a relatively high slope because this pollutant primarily affects the liver, kidneys, and central nervous system.12 The main carcinogenic properties affect the liver, not the lungs.13 Figure 10 shows that carbon tetrachloride is most strongly correlated with benzene, however, benzene is not strongly correlated with cancer risk. Thus, carbon tetrachloride is likely correlated with a different carcinogenic air pollutant which affects the respiratory system that was not explored here.

Figure 11: The prevalence of lung and bronchus cancer per 100 K population as a function of the pollutants formaldehyde, acetaldehyde, and carbon tetrachloride.

Chronic obstructive pulmonary disease

Finally, to explore the relationship between chronic obstructive pulmonary disease (COPD) and air quality, I created another correlation matrix to identify potentially interesting relationships. Figure 12 and Figure 13 show that like lung and bronchus cancer, formaldehyde, acetaldehyde, and carbon tetrachloride pollution were strongly correlated with COPD. However, unlike the relationships for lung and bronchus cancer, formaldehyde and acetaldehyde have higher slopes. This is consistent with studies showing that formaldehyde exposure through inhalation increases the risk of COPD.14 The relationship between COPD and acetaldehyde is likely just a result of the strong correlation between formaldehyde and acetaldehyde because acetaldehyde is not known to have chronic health effects.

Figure 12: Correlation matrix of COPD and air quality indicators.
Figure 13: Age adjusted percentage of COPD as a function of the pollutants formaldehyde, acetaldehyde, and carbon tetrachloride.

Sociodemographic effects on lung disease and air quality

Demographic age vulnerability

Given that poor air quality generally affects the young and the old, I explored how the distribution of age affected the relationship between lung disease and air quality. I identified counties with a relatively high proportion of young people or a relatively high proportion of older people as vulnerable.

Although Figure 8 did not show a clear relationship between emergency department visits and days over the ozone and PM 2.5 standards, Figure 14 highlights how counties with a high population of young or old individuals do in fact expect to see an increase in emergency department visits for asthma given poor air quality. For counties that are not vulnerable by age demographics, emergency department visits still does not have a clear relationship with ozone. However, for PM 2.5 pollution, emergency department visits decrease with increasing number of days over the PM 2.5 standard. This may still reflect the tendency of individuals to use air quality forecasts to limit exposure.

Figure 14: The crude rate of emergency deartment visits for asthma per 10 K population as a function of days over the ozone and PM 2.5 standard, disaggregated by demographic age vulnerability.

I was also interested in exploring how vulnerability by age demographics would affect the relationship between the prevalence of cancer and air pollutants. Figure 15 shows how the relationship between lung and bronchus cancer and air pollutants is the same across age vulnerabilities. However, counties that are vulnerable by age demographics generally have a lower prevalence of cancer. This is likely because children typically do not have enough time to develop lung and bronchus cancer at a young age, and people who have had lung and bronchus cancer may not live until older ages, so they would not be included in the population statistics.

Figure 15: The prevalence of lung and bronchus cancer per 100 K population as a function of the pollutants formaldehyde, acetaldehyde, and carbon tetrachloride, disaggregated by demographic age vulnerability.

Figure 16 shows that demographic age vulnerability had a similar effect on the relationship between COPD and air pollutants. However, it seems that for counties that did not have a high population of young and old individuals, the effect of formaldehyde and acetaldehyde pollution on COPD prevalence is greater. This is also likely another effect of how chronic diseases develop and affect age distributions.

Figure 16: Age adjusted percentage of COPD as a function of the pollutants formaldehyde, acetaldehyde, and carbon tetrachloride, disaggregated by demographic age vulnerability.

Gender vulnerability

Given that there are often differences in exposure to environmental hazards between men and women, I was interested to see if counties with a relatively higher proportion of women had a different relationship with asthma and air quality than counties with a relatively lower proportion of women. Figure 17 shows that in general, the effect of poor air quality on emergency department visits for asthma was larger for counties with a relatively higher proportion of women. The exception to this is PM 2.5, and this could be due to gendered differences in risk perception for poor air quality.15

Figure 17: The crude rate of emergency deartment visits for asthma per 10 K population as a function of days over the ozone and PM 2.5 standard, disaggregated by gender vulnerability.

Appendix

A. Identifying sociodemographic vulnerabilities